import the route information

This file clean_routes.csv gives the description as well as some predictor variables of all the routes (the name for the observation sites in the BBS data). This file does not contain the presence data for the bird species. The presence data is stored in bbs.csv and it’s 4.6GB. It not a good idea to open and close that file frequently XD. The route_id column appears in both clean_routes and bbs, and this column is a unique identifier for all the routes in the dataset.

clean_routes <- read.csv('data/cleaned/clean_routes.csv')
str(clean_routes)
## 'data.frame':    4575 obs. of  24 variables:
##  $ route_id    : Factor w/ 4575 levels "02_001","02_004",..: 463 471 513 472 514 515 516 517 518 519 ...
##  $ L3_KEY      : Factor w/ 131 levels "0.0.0  Water",..: 78 77 76 78 78 67 67 78 66 65 ...
##  $ L2_KEY      : Factor w/ 32 levels "0.0  WATER","10.1  COLD DESERTS",..: 22 22 22 22 22 21 21 22 21 21 ...
##  $ L1_KEY      : Factor w/ 14 levels "0  WATER","10  NORTH AMERICAN DESERTS",..: 12 12 12 12 12 11 11 12 11 11 ...
##  $ road_den    : num  176.2 174.1 55.3 586.8 0 ...
##  $ c_road_den  : num  0.102 0.095 -0.588 0.824 -3.012 ...
##  $ dist_shore  : num  2160 87549 6097 29296 22726 ...
##  $ group       : Factor w/ 3 levels "test","train",..: 2 2 2 2 2 1 1 2 1 1 ...
##  $ PC1         : num  4.08 5.49 9.34 6.48 5.7 ...
##  $ PC2         : num  -1.33 -4.07 -8.12 -4.89 -4.53 ...
##  $ PC3         : num  -3.62 -2.59 -5.9 -2.82 -4.18 ...
##  $ PC4         : num  0.9177 0.3851 -2.2221 0.0703 0.0189 ...
##  $ PC5         : num  -1.72 -1.97 -5.68 -2.3 -3.07 ...
##  $ PC6         : num  1.62 1.35 2.46 1.31 1.37 ...
##  $ PC7         : num  0.263 0.213 1.419 0.139 0.542 ...
##  $ PC8         : num  -0.2243 0.0323 -1.9571 -0.2831 -0.7831 ...
##  $ elevation   : num  174.7 81.6 141.5 90.4 620.4 ...
##  $ Latitude    : num  48.8 49.3 48.6 49.2 49.5 ...
##  $ Longitude   : num  -124 -122 -124 -123 -125 ...
##  $ c_lat       : num  0.929 0.998 0.901 0.983 1.028 ...
##  $ c_lon       : num  -1.51 -1.4 -1.55 -1.45 -1.59 ...
##  $ adj_elev    : num  320 227 287 236 766 ...
##  $ c_elevation : num  -0.723 -1.177 -0.868 -1.127 0.429 ...
##  $ c_dist_shore: num  -3.701 -0.964 -2.934 -1.774 -1.961 ...

Inspect the routes

Let’s plot the routes into a map using Latitude and Longitude.

clean_routes %>% 
  ggplot(aes(x = Longitude , y = Latitude)) +
  geom_point(shape = 1) +
  coord_sf(crs = st_crs(4326))

route_id

The first two digits of route_id indicate the state in which the route locates. There are 61 states in the dataset.

head(clean_routes$route_id , 10)
##  [1] 11_001 11_010 11_102 11_011 11_114 11_116 11_118 11_126 11_131 11_133
## 4575 Levels: 02_001 02_004 02_006 02_007 02_008 02_009 02_010 02_012 ... 93_314
clean_routes %>% 
  tidyr::separate(route_id , c("id1" , "id2") , sep = "_") %>% 
  ggplot(aes(x = Longitude , y = Latitude)) +
  geom_point(aes(color = id1) , shape = 1) +
  coord_sf(crs = st_crs(4326)) +
  labs(color = "state")

L2_KEY

L2_KEY encodes the EPA level 2 ecoregion.

unique(clean_routes$L2_KEY) %>% sort
##  [1] 0.0  WATER                                                
##  [2] 10.1  COLD DESERTS                                        
##  [3] 10.2  WARM DESERTS                                        
##  [4] 11.1  MEDITERRANEAN CALIFORNIA                            
##  [5] 12.1  WESTERN SIERRA MADRE PIEDMONT                       
##  [6] 13.1  UPPER GILA MOUNTAINS                                
##  [7] 15.4  EVERGLADES                                          
##  [8] 2.2  ALASKA TUNDRA                                        
##  [9] 2.3  BROOKS RANGE TUNDRA                                  
## [10] 2.4  SOUTHERN ARCTIC                                      
## [11] 3.1  ALASKA BOREAL INTERIOR                               
## [12] 3.2  TAIGA CORDILLERA                                     
## [13] 3.3  TAIGA PLAIN                                          
## [14] 3.4  TAIGA SHIELD                                         
## [15] 4.1  HUDSON PLAIN                                         
## [16] 5.1  SOFTWOOD SHIELD                                      
## [17] 5.2  MIXED WOOD SHIELD                                    
## [18] 5.3  ATLANTIC HIGHLANDS                                   
## [19] 5.4  BOREAL PLAIN                                         
## [20] 6.1  BOREAL CORDILLERA                                    
## [21] 6.2  WESTERN CORDILLERA                                   
## [22] 7.1  MARINE WEST COAST FOREST                             
## [23] 8.1  MIXED WOOD PLAINS                                    
## [24] 8.2  CENTRAL USA PLAINS                                   
## [25] 8.3  SOUTHEASTERN USA PLAINS                              
## [26] 8.4  OZARK/OUACHITA-APPALACHIAN FORESTS                   
## [27] 8.5  MISSISSIPPI ALLUVIAL AND SOUTHEAST USA COASTAL PLAINS
## [28] 9.2  TEMPERATE PRAIRIES                                   
## [29] 9.3  WEST-CENTRAL SEMIARID PRAIRIES                       
## [30] 9.4  SOUTH CENTRAL SEMIARID PRAIRIES                      
## [31] 9.5  TEXAS-LOUISIANA COASTAL PLAIN                        
## [32] 9.6  TAMAULIPAS-TEXAS SEMIARID PLAIN                      
## 32 Levels: 0.0  WATER 10.1  COLD DESERTS ... 9.6  TAMAULIPAS-TEXAS SEMIARID PLAIN

We can plot the ecoregions of the routes in a map.

clean_routes %>% 
  mutate(L2_KEY = stringr::str_match(L2_KEY , "\\d+\\.\\d+")) %>% 
  ggplot(aes(x = Longitude , y = Latitude)) +
  geom_point(aes(color = L2_KEY) , shape = 1) +
  coord_sf(crs = st_crs(4326))


Training/ validation/ test set

The grouping of training/ validation/ test sets is recorded in clean_routes as the column group. According the Joseph (2020):

To compare the performance of the three modelling approaches, the data were partitioned into a training, validation and test set at the EPA level 2 ecoregion (Roberts et al. 2017). All routes within an ecoregion were assigned to the same partition… For each of the three modelling approaches, routes in the training set were used for parameter estimation… Then, using the trained models, the mean predictive log-density of the validation data was evaluated to identify the best performing model… Finally, the best performing model was retrained using the training and validation data, and its predictive performance was evaluated on the withheld test set.

clean_routes %>% 
  mutate(group = factor(group , levels = c("train" , "validation" , "test"))) %>% 
  xtabs(~group , data = .)
## group
##      train validation       test 
##       2161        958       1456

We can see the partition of training/validation/test sets by:

knitr::kable({
  clean_routes %>% 
  mutate(group = factor(group , levels = c("train" , "validation" , "test"))) %>% 
  xtabs(~group + L2_KEY , data = .) %>% 
  as_tibble() %>% 
  tidyr::pivot_wider(names_from = group , values_from = n)
})
L2_KEY train validation test
0.0 WATER 2 0 0
10.1 COLD DESERTS 353 0 0
10.2 WARM DESERTS 0 122 0
11.1 MEDITERRANEAN CALIFORNIA 94 0 0
12.1 WESTERN SIERRA MADRE PIEDMONT 0 10 0
13.1 UPPER GILA MOUNTAINS 27 0 0
15.4 EVERGLADES 0 0 10
2.2 ALASKA TUNDRA 0 11 0
2.3 BROOKS RANGE TUNDRA 0 3 0
2.4 SOUTHERN ARCTIC 0 0 3
3.1 ALASKA BOREAL INTERIOR 0 0 15
3.2 TAIGA CORDILLERA 0 0 8
3.3 TAIGA PLAIN 16 0 0
3.4 TAIGA SHIELD 0 14 0
4.1 HUDSON PLAIN 3 0 0
5.1 SOFTWOOD SHIELD 0 128 0
5.2 MIXED WOOD SHIELD 0 0 257
5.3 ATLANTIC HIGHLANDS 0 199 0
5.4 BOREAL PLAIN 0 125 0
6.1 BOREAL CORDILLERA 0 62 0
6.2 WESTERN CORDILLERA 0 0 449
7.1 MARINE WEST COAST FOREST 110 0 0
8.1 MIXED WOOD PLAINS 411 0 0
8.2 CENTRAL USA PLAINS 0 0 149
8.3 SOUTHEASTERN USA PLAINS 582 0 0
8.4 OZARK/OUACHITA-APPALACHIAN FORESTS 0 0 309
8.5 MISSISSIPPI ALLUVIAL AND SOUTHEAST USA COASTAL PLAINS 0 222 0
9.2 TEMPERATE PRAIRIES 266 0 0
9.3 WEST-CENTRAL SEMIARID PRAIRIES 0 0 256
9.4 SOUTH CENTRAL SEMIARID PRAIRIES 297 0 0
9.5 TEXAS-LOUISIANA COASTAL PLAIN 0 42 0
9.6 TAMAULIPAS-TEXAS SEMIARID PLAIN 0 20 0

So the 32 Level-2-Ecoregions are randomly assigned to be training, validation or test (the code for this can be found in the 03-extract-route-feature.R file on Github). We can also look at the routes of training, validation and test set by a map.

clean_routes %>% 
  mutate(group = factor(group , levels = c("train" , "validation" , "test"))) %>% 
  ggplot(aes(x = Longitude , y = Latitude)) +
  geom_point(aes(color = group) , shape = 1) +
  coord_sf(crs = st_crs(4326))


How do I subset the dataset by routes

We discussed last week (16.12) that it is more ideal to randomly remove blocks of data instead of just randomly remove data points. Now I am wondering which kind of block I should consider when I subset the data. The training data in particular. I thought of two possibilities:

by state

The first one is to subset the data by state. There are 61 states in the dataset. Maybe I can train the model only with some randomly selected states.

clean_routes %>% 
  tidyr::separate(route_id , c("id1" , "id2") , sep = "_") %>% 
  # only the training set
  dplyr::filter(group == "train") %>% 
  # randomly select some states in the training set
  dplyr::filter(id1 %in% sample(.$id1 , size = round(length(unique(.$id1)) * 1/4) , replace = FALSE )) %>% 
  # include the validation and test sets
  dplyr::union({
    clean_routes %>% 
      tidyr::separate(route_id , c("id1" , "id2") , sep = "_") %>% 
      dplyr::filter(group != "train")
  }) %>% 
  mutate(subset = "1/4 training") %>% 
  # the original dataset without subsetting for visualization
  dplyr::union({
    clean_routes %>% 
      tidyr::separate(route_id , c("id1" , "id2") , sep = "_") %>% 
      mutate(subset = "original")
  }) %>% 
  # re-ordering for visualization
  mutate(group = factor(group , levels = c("train" , "validation" , "test"))) %>% 
  # plot
  ggplot(aes(x = Longitude , y = Latitude)) +
  geom_point(aes(color = id1) , shape = 1) +
  facet_grid(subset~group) +
  coord_sf(crs = st_crs(4326)) +
  theme(legend.position = "bottom") +
  labs(color = "state")

So the idea of this subset by state is illustrated in the graph. In the panel above, only 1/4 (for example) of the states in the training dataset remains, and I am training the three models only with these data from fewer states.

I think this way of subsetting the data make sense in some way. Of course I am removing blocks of routes instead of random ones. Also I can connect it to some imagined senarios like Trump decided to freeze federal budget and only 1/4 of the states are able to continue with the survey (although Canada is also included in the project and the surveys are carried out by volunteers anyway XD). However, state is a political concept and does not necessarily carry any ecological information.

by ecoregion

The other way is to subset the data by ecoregions. That is, only the data of some ecoregions of the training set remains and is used to train the model. The idea is illustrated in this graph:

clean_routes %>% 
  # only the training set
  dplyr::filter(group == "train") %>% 
  # randomly select some ecoregions in the training set
  dplyr::filter(L2_KEY %in% sample(.$L2_KEY , size = round(length(unique(.$L2_KEY)) * 1/4) , replace = FALSE )) %>% 
  # include the validation and test sets
  dplyr::union({
    clean_routes %>% 
      dplyr::filter(group != "train")
  }) %>% 
  mutate(subset = "1/4 training") %>% 
  # the original dataset without subsetting for visualization
  dplyr::union({
    clean_routes %>% 
      mutate(subset = "original")
  }) %>% 
  # re-ordering for visualization
  mutate(group = factor(group , levels = c("train" , "validation" , "test"))) %>% 
  # simplification for visualization
  mutate(L2_KEY = stringr::str_match(L2_KEY , "\\d+\\.\\d+")) %>% 
  # plot
  ggplot(aes(x = Longitude , y = Latitude)) +
  geom_point(aes(color = L2_KEY) , shape = 1) +
  facet_grid(subset~group) +
  coord_sf(crs = st_crs(4326)) +
  theme(legend.position = "bottom") +
  labs(color = "Ecoregion")


My questions

My first question would be, which among the two ways of subsetting (by state and by ecoregion) makes more sense? My personal preference is by state because I think it’s more interesting.

The second question is, I only should subset the training data, but use the same (original) validation dataset to validate the different models trained with data of different sizes, right? For example I have two models, one trained with 1/4 of the states and the other trained with all the states. When I validate the two model, I still use the original dataset to make the validation results comparable, right? (It’s like some “extrapolation” for the model trained with fewer states).


20201221 re-assign the train/validation/test set by state

set.seed(1221)
clean_routes %>% 
  tidyr::separate(route_id , c("state" , "id2") , sep = "_") %>% 
  left_join({
  clean_routes %>% 
    tidyr::separate(route_id , c("state" , "id2") , sep = "_") %>% 
    distinct(state) %>% 
    mutate(group_by.state = sample(c("train" , "validation" , "test") , 
                                   size = nrow(.) ,
                                   replace = TRUE , prob = c(1/3,1/3,1/3)))
  } , by = "state") %>% 
  # re-ordering for visualization
  mutate(group_by.state = factor(group_by.state , levels = c("train" , "validation" , "test"))) %>%
  ggplot(aes(x = Longitude , y = Latitude , color = state)) +
  geom_point(shape = 1) +
  facet_grid(~group_by.state) +
  coord_sf(crs = st_crs(4326)) +
  theme(legend.position = "bottom")